This notebook are complemented with new content sometimes. So, don't be scared if there isn't enough material or it not completed.
import numpy as np
import pandas as pd
import os
import seaborn as sns
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
# this code specifies a path for files, since I'm working on my local machine and pushing notebooks to kaggle
BASE_PATH = ""
if 'KAGGLE_KERNEL_RUN_TYPE' in os.environ:
BASE_PATH = '/kaggle/input/playground-series-s5e1/'
else:
BASE_PATH = 'kaggle/input/playground-series-s5e1/'
Exploratory Data Analysis (EDA)¶
Exploratory Data Analysis allows us to make a first look on dataset and get some insights.
Also, we can dive deeper into a problem, in our case sticker sales forecasting. By careful and thoughtful analysis we can discover some patterns which can help us to improve model's general perfomance.
Our objectives in this notebook:
- Gain insights from data.
- Vizualize distributions, relationships and patterns.
- Identify anomalies, such as outliers, missing values etc.
- Dive deeper into time series.
- Train a model for forecasting sticker sales.
Let's take a look on data!
Dataset Overview¶
X_train = pd.read_csv(f'{BASE_PATH}train.csv')
X_test = pd.read_csv(f'{BASE_PATH}test.csv')
print(f'Train dataset contains {X_train.shape[0]} rows and {X_train.shape[1]} columns.')
X_train.head()
Train dataset contains 230130 rows and 6 columns.
| id | date | country | store | product | num_sold | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2010-01-01 | Canada | Discount Stickers | Holographic Goose | NaN |
| 1 | 1 | 2010-01-01 | Canada | Discount Stickers | Kaggle | 973.0 |
| 2 | 2 | 2010-01-01 | Canada | Discount Stickers | Kaggle Tiers | 906.0 |
| 3 | 3 | 2010-01-01 | Canada | Discount Stickers | Kerneler | 423.0 |
| 4 | 4 | 2010-01-01 | Canada | Discount Stickers | Kerneler Dark Mode | 491.0 |
X_train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 230130 entries, 0 to 230129 Data columns (total 6 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 id 230130 non-null int64 1 date 230130 non-null object 2 country 230130 non-null object 3 store 230130 non-null object 4 product 230130 non-null object 5 num_sold 221259 non-null float64 dtypes: float64(1), int64(1), object(4) memory usage: 10.5+ MB
# Save 'id' column for submission
test_ids = X_test['id']
# Define the target column
target_column = 'num_sold'
# Select categorical and numerical columns (initial)
categorical_columns = X_train.select_dtypes(include=['object']).columns
numerical_columns = X_train.select_dtypes(exclude=['object']).columns
# Print out column information
print("Target Column:", target_column)
print("\nCategorical Columns:", categorical_columns.tolist())
print("\nNumerical Columns:", numerical_columns.tolist())
Target Column: num_sold Categorical Columns: ['date', 'country', 'store', 'product'] Numerical Columns: ['id', 'num_sold']
A quick notice, in comparision with previous competitions this dataset contains a small amount of features. Mostly we have categorical features, so later we should try different options to achieve the best perfomance such as:
- Different ways to encode features.
- Feature engineering from existing features, especially from date column, because it can contain a useful patterns.
Descriptive Statistics¶
X_train.describe()
| id | num_sold | |
|---|---|---|
| count | 230130.000000 | 221259.000000 |
| mean | 115064.500000 | 752.527382 |
| std | 66432.953062 | 690.165445 |
| min | 0.000000 | 5.000000 |
| 25% | 57532.250000 | 219.000000 |
| 50% | 115064.500000 | 605.000000 |
| 75% | 172596.750000 | 1114.000000 |
| max | 230129.000000 | 5939.000000 |
for column in categorical_columns:
num_unique = X_train[column].nunique()
print(f"'{column}' has {num_unique} unique categories.")
'date' has 2557 unique categories. 'country' has 6 unique categories. 'store' has 3 unique categories. 'product' has 5 unique categories.
# Print top 10 unique value counts for each categorical column
for column in categorical_columns:
print(f"\nTop value counts in '{column}':\n{X_train[column].value_counts().head(10)}")
Top value counts in 'date': date 2010-01-01 90 2014-09-05 90 2014-08-29 90 2014-08-30 90 2014-08-31 90 2014-09-01 90 2014-09-02 90 2014-09-03 90 2014-09-04 90 2014-09-06 90 Name: count, dtype: int64 Top value counts in 'country': country Canada 38355 Finland 38355 Italy 38355 Kenya 38355 Norway 38355 Singapore 38355 Name: count, dtype: int64 Top value counts in 'store': store Discount Stickers 76710 Stickers for Less 76710 Premium Sticker Mart 76710 Name: count, dtype: int64 Top value counts in 'product': product Holographic Goose 46026 Kaggle 46026 Kaggle Tiers 46026 Kerneler 46026 Kerneler Dark Mode 46026 Name: count, dtype: int64
print("The mean of columns:")
print(X_train[numerical_columns].mean())
print("\nThe std dev of columns:")
print(X_train[numerical_columns].std())
print("\nThe skewness of columns:")
print(X_train[numerical_columns].skew())
The mean of columns: id 115064.500000 num_sold 752.527382 dtype: float64 The std dev of columns: id 66432.953062 num_sold 690.165445 dtype: float64 The skewness of columns: id -6.374278e-16 num_sold 1.415373e+00 dtype: float64
Data Cleaning Insights¶
plt.figure(figsize=(15,9))
plt.title("Visualizing Missing Values")
sns.heatmap(X_train.isnull(), cbar=False, cmap=sns.color_palette('magma'), yticklabels=False)
plt.show()
Visual Exploration¶
filtered_columns = [col for col in categorical_columns if col != 'date']
fig, axes = plt.subplots(len(filtered_columns), 2, figsize=(15, 5 * len(filtered_columns)))
for i, column in enumerate(filtered_columns):
sns.countplot(data=X_train, x=column, ax=axes[i, 0], palette='tab10')
axes[i, 0].set_title(f'Distribution of {column}', fontsize=14)
axes[i, 0].set_xlabel(column, fontsize=12)
axes[i, 0].set_ylabel('Count', fontsize=12)
sns.despine(ax=axes[i, 0])
sns.boxplot(data=X_train, x=column, y=target_column, ax=axes[i, 1], palette='tab10')
axes[i, 1].set_title(f'{column} vs {target_column}', fontsize=14)
axes[i, 1].set_xlabel(column, fontsize=12)
axes[i, 1].set_ylabel(target_column, fontsize=12)
sns.despine(ax=axes[i, 1])
plt.tight_layout()
plt.show()
sns.kdeplot(X_train['num_sold'])
<Axes: xlabel='num_sold', ylabel='Density'>
Conclusions¶
After EDA we can draw some conclusions about our data (not including time series for now):
- Our dataset contains from 6 features. Our target variable is 'num_sold'.
- Categorical features are uniformly distributed meaning that we have equal number of records of different categories.
- Target variable contains missing values, which we should process carefully later, cause in time series context we forecasting our sales from previous data, so a wrong way for imputation can skew our results.
- Num_sold contains a high number of outliers for different categories. From this we can suggest that at some specific time of year (e.g holidays) number of sales are increasing.
- Num_sold is positively skewed (to right), so later we may consider to apply transformation to variable.
Time Series EDA¶
Previously we've done EDA for other features, but I decided to make analysis for time series as separate chapter, cause EDA for time series is different than other features.
Analysing Missing Values¶
In this section we'll have a look on data that contains NaN in our target variable.
missing_num_sold_df = X_train[X_train['num_sold'].isna()]
missing_num_sold_df
| id | date | country | store | product | num_sold | |
|---|---|---|---|---|---|---|
| 0 | 0 | 2010-01-01 | Canada | Discount Stickers | Holographic Goose | NaN |
| 45 | 45 | 2010-01-01 | Kenya | Discount Stickers | Holographic Goose | NaN |
| 90 | 90 | 2010-01-02 | Canada | Discount Stickers | Holographic Goose | NaN |
| 135 | 135 | 2010-01-02 | Kenya | Discount Stickers | Holographic Goose | NaN |
| 180 | 180 | 2010-01-03 | Canada | Discount Stickers | Holographic Goose | NaN |
| ... | ... | ... | ... | ... | ... | ... |
| 229905 | 229905 | 2016-12-29 | Kenya | Discount Stickers | Holographic Goose | NaN |
| 229950 | 229950 | 2016-12-30 | Canada | Discount Stickers | Holographic Goose | NaN |
| 229995 | 229995 | 2016-12-30 | Kenya | Discount Stickers | Holographic Goose | NaN |
| 230040 | 230040 | 2016-12-31 | Canada | Discount Stickers | Holographic Goose | NaN |
| 230085 | 230085 | 2016-12-31 | Kenya | Discount Stickers | Holographic Goose | NaN |
8871 rows × 6 columns
Hmm, from first look it seems that all missing values are coming from the same place. Let's print value_counts for each categorical column
for c in ['country', 'store', 'product']:
print(missing_num_sold_df[c].value_counts())
print()
country Kenya 4625 Canada 4246 Name: count, dtype: int64 store Discount Stickers 5179 Stickers for Less 2666 Premium Sticker Mart 1026 Name: count, dtype: int64 product Holographic Goose 8806 Kerneler 64 Kerneler Dark Mode 1 Name: count, dtype: int64
Indeed! Most of missing data are coming from Holographic Goose Product.
While exploring discussions I've come to conclusion that most suitable approach is to drop rows with missing data. The reason behind this that in time series forecasting is highly depending on previous values. If we've decided to impute data with mean for example, it would skew our results and perfomance will be poor.
Dropping Missing values¶
Typically it's not a part of EDA, but later we would decompose our time series, so we need to drop missing values accordingly.
X_train = X_train.dropna(axis=0)
X_train.isna().sum()
id 0 date 0 country 0 store 0 product 0 num_sold 0 dtype: int64
X_train.date = X_train.date.astype('datetime64[ns]')
Line plot of Total Sales Over Time¶
plt.figure(figsize=(28, 6))
X_train.groupby('date')['num_sold'].sum().plot(title='Total Sales Over Time', xlabel='Date', ylabel='Number of Products Sold')
plt.grid()
plt.show()
Decomposing Time Series¶
Time Series can be described as few individual components, which are including:
- Trend: the long-term direction or pattern in data. Trend can be viewed as straight (increasing or decreasing) line or stable over time.
- Seasonality: reflects to regular, repeating patterns withing the data. For example, increasing water bottles sales at summer, this happens every year or so.
- Cyclic: similar to seasonality, but with difference that patterns don't happen in fixed time snaps.
- Noise (Residuals): random variation or irregularity that remains after removing the trend and seasonal components.
Time Series Trend¶
X_data_index = X_train.set_index('date')
from statsmodels.tsa.seasonal import seasonal_decompose
monthly_data = X_data_index.resample('D').sum()
monthly_data['year'] = monthly_data.index.year
monthly_data['month'] = monthly_data.index.month
trend_data = []
for year in monthly_data['year'].unique():
yearly_data = monthly_data[monthly_data['year'] == year]
decomposition = seasonal_decompose(yearly_data['num_sold'], model='additive', period=12)
trend = pd.DataFrame({
"trend": decomposition.trend,
"month": yearly_data['month'],
"year": year
})
trend_data.append(trend)
trend_df = pd.concat(trend_data)
plt.figure(figsize=(20, 10))
sns.lineplot(data=trend_df, x='month', y='trend', hue='year', palette='tab10')
plt.title("Trend Components Over Years")
plt.xlabel("Month")
plt.ylabel("Trend Component")
plt.grid(True)
plt.show()
Time Series Seasonal¶
seasonal_data = []
for year in monthly_data['year'].unique():
yearly_data = monthly_data[monthly_data['year'] == year]
decomposition = seasonal_decompose(yearly_data['num_sold'], model='additive', period=12)
seasonal = pd.DataFrame({
"seasonal": decomposition.seasonal,
"month": yearly_data['month'],
"year": year
})
seasonal_data.append(seasonal)
seasonal_df = pd.concat(seasonal_data)
plt.figure(figsize=(20, 10))
sns.lineplot(data=seasonal_df, x='month', y='seasonal', hue='year', palette='tab10')
plt.title("Seasonal Components Over Years")
plt.xlabel("Month")
plt.ylabel("Seasonal Component")
plt.grid(True)
plt.show()
Time Series Residuals¶
resid_data = []
for year in monthly_data['year'].unique():
yearly_data = monthly_data[monthly_data['year'] == year]
decomposition = seasonal_decompose(yearly_data['num_sold'], model='additive', period=12)
resid = pd.DataFrame({
"resid": decomposition.resid,
"month": yearly_data['month'],
"year": year
})
resid_data.append(resid)
resid_df = pd.concat(resid_data)
plt.figure(figsize=(20, 10))
sns.lineplot(data=resid_df, x='month', y='resid', hue='year', palette='tab10')
plt.title("Resid Components Over Years")
plt.xlabel("Month")
plt.ylabel("Resid Component")
plt.grid(True)
plt.show()
Stationary Testing¶
Let's check series stationarity by using Augmented Dickey–Fuller test
from statsmodels.tsa.stattools import adfuller
def test_stationarity(series, title=''):
print(f"Results of ADF Test on {title}:")
result = adfuller(series, autolag='AIC')
print(f"ADF Statistic: {result[0]}")
print(f"p-value: {result[1]}")
if result[1] > 0.05:
print("Non-stationary")
else:
print("stationary")
for key, value in result[4].items():
print(f"Critical Value ({key}): {value}")
print("\n")
test_stationarity(X_train['num_sold'], 'Sold')
Results of ADF Test on Sold: ADF Statistic: -32.10570496433421 p-value: 0.0 stationary Critical Value (1%): -3.430379566523776 Critical Value (5%): -2.8615530680191887 Critical Value (10%): -2.5667769556355844
It seems that our time series is stationary, since p-value = 0.0 and so no further differencing is required before applying models like ARIMA or SARIMA.
Forecasting Time Series¶
In this section I will try to forecast time series by using statistical methods (ARIMA, SARIMA etc) and machine learning methods. It will be interesting to give it a try for different approaches.
Statistical approaches¶
ARIMA¶
ARIMA stands for AutoRegressive Integrated Moving Average
Finding ARIMA parameters¶
from statsmodels.graphics.tsaplots import plot_acf, plot_pacf
plot_acf(X_train['num_sold'], lags=40)
plot_pacf(X_train['num_sold'], lags=40)
plt.show()
train_end = '2015-01-01'
val_end = '2017-01-01'
# Filter data based on date ranges
train_data = X_train[X_train['date'] < train_end]
val_data = X_train[(X_train['date'] >= train_end) & (X_train['date'] < val_end)]
# Convert 'date' column to datetime
train_data['date'] = pd.to_datetime(train_data['date'])
val_data['date'] = pd.to_datetime(val_data['date'])
# Set 'date' column as the index for both train and validation data
train_data = train_data.set_index('date')
val_data = val_data.set_index('date')
# Check if the data was correctly processed
print(f"Train data: {train_data.shape}")
print(f"Validation data: {val_data.shape}")
Train data: (157849, 5) Validation data: (63410, 5)
train_data.isna().sum()
id 0 country 0 store 0 product 0 num_sold 0 dtype: int64
from sklearn.metrics import mean_squared_error
from statsmodels.tsa.arima.model import ARIMA
# Fit the ARIMA model on the training dataset
model_train = ARIMA(train_data['num_sold'], order=(1, 1, 1))
model_train_fit = model_train.fit()
# Check for model fit summary
print(model_train_fit.summary())
# Forecast on the test dataset
test_forecast = model_train_fit.get_forecast(steps=len(val_data))
# Check if forecast is not NaN
if test_forecast.predicted_mean.isna().any():
print("Warning: Some forecasted values are NaN")
test_forecast_series = pd.Series(test_forecast.predicted_mean, index=val_data.index)
# Check if test_forecast_series contains NaNs
if test_forecast_series.isna().any():
print("Warning: test_forecast_series contains NaNs")
# # Calculate the mean squared error
# mse = mean_squared_error(val_data['num_sold'], test_forecast_series)
# rmse = mse**0.5
SARIMAX Results
==============================================================================
Dep. Variable: num_sold No. Observations: 157849
Model: ARIMA(1, 1, 1) Log Likelihood -1244476.898
Date: Sun, 05 Jan 2025 AIC 2488959.797
Time: 21:26:33 BIC 2488989.705
Sample: 0 HQIC 2488968.691
- 157849
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
ar.L1 0.4506 0.002 188.554 0.000 0.446 0.455
ma.L1 -0.9997 6.22e-05 -1.61e+04 0.000 -1.000 -1.000
sigma2 4.125e+05 821.364 502.238 0.000 4.11e+05 4.14e+05
===================================================================================
Ljung-Box (L1) (Q): 313.74 Jarque-Bera (JB): 281371.55
Prob(Q): 0.00 Prob(JB): 0.00
Heteroskedasticity (H): 1.06 Skew: 1.93
Prob(H) (two-sided): 0.00 Kurtosis: 8.29
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
Warning: test_forecast_series contains NaNs
test_forecast_series
date
2015-01-01 NaN
2015-01-01 NaN
2015-01-01 NaN
2015-01-01 NaN
2015-01-01 NaN
..
2016-12-31 NaN
2016-12-31 NaN
2016-12-31 NaN
2016-12-31 NaN
2016-12-31 NaN
Name: predicted_mean, Length: 63410, dtype: float64
# Create a plot to compare the forecast with the actual test data
plt.figure(figsize=(14,7))
plt.plot(train_data['num_sold'], label='Training Data')
plt.plot(val_data['num_sold'], label='Actual Data', color='orange')
plt.plot(test_forecast_series, label='Forecasted Data', color='green')
plt.fill_between(val_data.index,
test_forecast.conf_int().iloc[:, 0],
test_forecast.conf_int().iloc[:, 1],
color='k', alpha=.15)
plt.title('ARIMA Model Evaluation')
plt.xlabel('Date')
plt.ylabel('Number of Births')
plt.legend()
plt.show()